The COPLE2 corpus: a learner corpus for Portuguese

نویسندگان

  • Amália Mendes
  • Sandra Antunes
  • Maarten Janssen
  • Anabela Gonçalves
چکیده

We present the COPLE2 corpus, a learner corpus of Portuguese that includes written and spoken texts produced by learners of Portuguese as a second or foreign language. The corpus includes at the moment a total of 182,474 tokens and 978 texts, classified according to the CEFR scales. The original handwritten productions are transcribed in TEI compliant XML format and keep record of all the original information, such as reformulations, insertions and corrections made by the teacher, while the recordings are transcribed and aligned with EXMARaLDA. The TEITOK environment enables different views of the same document (XML, student version, corrected version), a CQP-based search interface, the POS, lemmatization and normalization of the tokens, and will soon be used for error annotation in stand-off format. The corpus has already been a source of data for phonological, lexical and syntactic interlanguage studies and will be used for a data-informed selection of language features for each proficiency level.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners

Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...

متن کامل

Metadiscourse Markers in a Corpus of Learner Language: The Case of Iranian EFL Learners

Different issues have been probed in learner corpus research since the late 1980s.However, taking the im- portance of meta discourse markers (MDMs) in signposting academic discourse, their use in Iranian EFL learners‟ academic essays is an area of research in need of a more serious analysis. Contributing to this line of investigation, this paper reports a corpus-based study of the use of MDMs i...

متن کامل

How textbooks (and learners) get it wrong: A corpus study of modal auxiliary verbs

Many  elements  contribute  to  the  relative  difficulty  in  acquiring  specific  aspects  of  English  as  a foreign  language  (Goldschneider  &  DeKeyser,  2001).  Modal  auxiliary  verbs  (e.g.  could,  might), are  examples  of  a  structure  that  is  difficult  for  many  learners.  Not  only  are  they  particularly complex  semantically,  but  especially  in  the  Malaysian  context ...

متن کامل

The Presence and Influence of English in the Portuguese Financial Media

As the lingua franca of the 21st century, English has become the main language for intercultural communication for those wanting to embrace globalization. In Portugal, it is the second language of most public and private domains influencing its culture and discourses. Language contact situations transform languages by the incorporations they make from other languages and Portugal has...

متن کامل

A Corpus-driven Food Science and Technology Academic Word List

The overarching goal of this study was to create a list of the most frequently occurring academic words in Food Science and Technology (FST). To this end, a 4,652,444-word corpus called Food Science and Technology Research Articles (FSTRA), which included 1,421 research articles (RAs) randomly selected from 38 journals across five sub-disciplines in FST, was developed. Frequency and range-based...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016